I have chosen to explore the Prosper Loan Dataset. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
Here’s a link to a page with definitions for the variables used in this dataset: https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0
First, I’m going to run a few summaries to try and get a feel for my data. I’ll see what it looks like in general, and look for anything that stands out to me. This dataset has a lot of observations so there should be plenty of options.
## ListingKey ListingNumber
## 17A93590655669644DB4C06: 6 Min. : 4
## 349D3587495831350F0F648: 4 1st Qu.: 400919
## 47C1359638497431975670B: 4 Median : 600554
## 8474358854651984137201C: 4 Mean : 627886
## DE8535960513435199406CE: 4 3rd Qu.: 892634
## 04C13599434217079754AEE: 3 Max. :1255725
## (Other) :113912
## ListingCreationDate CreditGrade Term
## 2013-10-02 17:20:16.550000000: 6 :84984 Min. :12.00
## 2013-08-28 20:31:41.107000000: 4 C : 5649 1st Qu.:36.00
## 2013-09-08 09:27:44.853000000: 4 D : 5153 Median :36.00
## 2013-12-06 05:43:13.830000000: 4 B : 4389 Mean :40.83
## 2013-12-06 11:44:58.283000000: 4 AA : 3509 3rd Qu.:36.00
## 2013-08-21 07:25:22.360000000: 3 HR : 3508 Max. :60.00
## (Other) :113912 (Other): 6745
## LoanStatus ClosedDate
## Current :56576 :58848
## Completed :38074 2014-03-04 00:00:00: 105
## Chargedoff :11992 2014-02-19 00:00:00: 100
## Defaulted : 5018 2014-02-11 00:00:00: 92
## Past Due (1-15 days) : 806 2012-10-30 00:00:00: 81
## Past Due (31-60 days): 363 2013-02-26 00:00:00: 78
## (Other) : 1108 (Other) :54633
## BorrowerAPR BorrowerRate LenderYield
## Min. :0.00653 Min. :0.0000 Min. :-0.0100
## 1st Qu.:0.15629 1st Qu.:0.1340 1st Qu.: 0.1242
## Median :0.20976 Median :0.1840 Median : 0.1730
## Mean :0.21883 Mean :0.1928 Mean : 0.1827
## 3rd Qu.:0.28381 3rd Qu.:0.2500 3rd Qu.: 0.2400
## Max. :0.51229 Max. :0.4975 Max. : 0.4925
## NA's :25
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :-0.183 Min. :0.005 Min. :-0.183
## 1st Qu.: 0.116 1st Qu.:0.042 1st Qu.: 0.074
## Median : 0.162 Median :0.072 Median : 0.092
## Mean : 0.169 Mean :0.080 Mean : 0.096
## 3rd Qu.: 0.224 3rd Qu.:0.112 3rd Qu.: 0.117
## Max. : 0.320 Max. :0.366 Max. : 0.284
## NA's :29084 NA's :29084 NA's :29084
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## Min. :1.000 :29084 Min. : 1.00
## 1st Qu.:3.000 C :18345 1st Qu.: 4.00
## Median :4.000 B :15581 Median : 6.00
## Mean :4.072 A :14551 Mean : 5.95
## 3rd Qu.:5.000 D :14274 3rd Qu.: 8.00
## Max. :7.000 E : 9795 Max. :11.00
## NA's :29084 (Other):12307 NA's :29084
## ListingCategory..numeric. BorrowerState
## Min. : 0.000 CA :14717
## 1st Qu.: 1.000 TX : 6842
## Median : 1.000 NY : 6729
## Mean : 2.774 FL : 6720
## 3rd Qu.: 3.000 IL : 5921
## Max. :20.000 : 5515
## (Other):67493
## Occupation EmploymentStatus
## Other :28617 Employed :67322
## Professional :13628 Full-time :26355
## Computer Programmer : 4478 Self-employed: 6134
## Executive : 4311 Not available: 5347
## Teacher : 3759 Other : 3806
## Administrative Assistant: 3688 : 2255
## (Other) :55456 (Other) : 2718
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 False:56459 False:101218
## 1st Qu.: 26.00 True :57478 True : 12719
## Median : 67.00
## Mean : 96.07
## 3rd Qu.:137.00
## Max. :755.00
## NA's :7625
## GroupKey DateCreditPulled
## :100596 2013-12-23 09:38:12: 6
## 783C3371218786870A73D20: 1140 2013-11-21 09:09:41: 4
## 3D4D3366260257624AB272D: 916 2013-12-06 05:43:16: 4
## 6A3B336601725506917317E: 698 2014-01-14 20:17:49: 4
## FEF83377364176536637E50: 611 2014-02-09 12:14:41: 4
## C9643379247860156A00EC0: 342 2013-09-27 22:04:54: 3
## (Other) : 9634 (Other) :113912
## CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.0 Min. : 19.0
## 1st Qu.:660.0 1st Qu.:679.0
## Median :680.0 Median :699.0
## Mean :685.6 Mean :704.6
## 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :880.0 Max. :899.0
## NA's :591 NA's :591
## FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
## : 697 Min. : 0.00 Min. : 0.00
## 1993-12-01 00:00:00: 185 1st Qu.: 7.00 1st Qu.: 6.00
## 1994-11-01 00:00:00: 178 Median :10.00 Median : 9.00
## 1995-11-01 00:00:00: 168 Mean :10.32 Mean : 9.26
## 1990-04-01 00:00:00: 161 3rd Qu.:13.00 3rd Qu.:12.00
## 1995-03-01 00:00:00: 159 Max. :59.00 Max. :54.00
## (Other) :112389 NA's :7604 NA's :7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 2.00 Min. : 0.00
## 1st Qu.: 17.00 1st Qu.: 4.00
## Median : 25.00 Median : 6.00
## Mean : 26.75 Mean : 6.97
## 3rd Qu.: 35.00 3rd Qu.: 9.00
## Max. :136.00 Max. :51.00
## NA's :697
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 114.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 271.0 Median : 1.000 Median : 4.000
## Mean : 398.3 Mean : 1.435 Mean : 5.584
## 3rd Qu.: 525.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :14985.0 Max. :105.000 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5921 Mean : 984.5 Mean : 4.155
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :83.0000 Max. :463881.0 Max. :99.000
## NA's :697 NA's :7622 NA's :990
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 3121
## Median : 0.0000 Median : 0.000 Median : 8549
## Mean : 0.3126 Mean : 0.015 Mean : 17599
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 19521
## Max. :38.0000 Max. :20.000 Max. :1435667
## NA's :697 NA's :7604 NA's :7604
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.000 Min. : 0 Min. : 0.00
## 1st Qu.:0.310 1st Qu.: 880 1st Qu.: 15.00
## Median :0.600 Median : 4100 Median : 22.00
## Mean :0.561 Mean : 11210 Mean : 23.23
## 3rd Qu.:0.840 3rd Qu.: 13180 3rd Qu.: 30.00
## Max. :5.950 Max. :646285 Max. :126.00
## NA's :7604 NA's :7544 NA's :7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## Min. :0.000 Min. : 0.000
## 1st Qu.:0.820 1st Qu.: 0.000
## Median :0.940 Median : 0.000
## Mean :0.886 Mean : 0.802
## 3rd Qu.:1.000 3rd Qu.: 1.000
## Max. :1.000 Max. :20.000
## NA's :7544 NA's :7544
## DebtToIncomeRatio IncomeRange IncomeVerifiable
## Min. : 0.000 $25,000-49,999:32192 False: 8669
## 1st Qu.: 0.140 $50,000-74,999:31050 True :105268
## Median : 0.220 $100,000+ :17337
## Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 0.320 Not displayed : 7741
## Max. :10.010 $1-24,999 : 7274
## NA's :8554 (Other) : 1427
## StatedMonthlyIncome LoanKey TotalProsperLoans
## Min. : 0 CB1B37030986463208432A1: 6 Min. :0.00
## 1st Qu.: 3200 2DEE3698211017519D7333F: 4 1st Qu.:1.00
## Median : 4667 9F4B37043517554537C364C: 4 Median :1.00
## Mean : 5608 D895370150591392337ED6D: 4 Mean :1.42
## 3rd Qu.: 6825 E6FB37073953690388BC56D: 4 3rd Qu.:2.00
## Max. :1750003 0D8F37036734373301ED419: 3 Max. :8.00
## (Other) :113912 NA's :91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 9.00
## Median : 16.00 Median : 15.00
## Mean : 22.93 Mean : 22.27
## 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :141.00 Max. :141.00
## NA's :91852 NA's :91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.61 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :91852 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6000 Median : 1627
## Mean : 8472 Mean : 2930
## 3rd Qu.:11000 3rd Qu.: 4127
## Max. :72499 Max. :23451
## NA's :91852 NA's :91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.00 Min. : 0.0
## 1st Qu.: -35.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -3.22 Mean : 152.8
## 3rd Qu.: 25.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :2704.0
## NA's :95009
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 6.0 1st Qu.: 37332
## Median :14.00 Median : 21.0 Median : 68599
## Mean :16.27 Mean : 31.9 Mean : 69444
## 3rd Qu.:22.00 3rd Qu.: 65.0 3rd Qu.:101901
## Max. :44.00 Max. :100.0 Max. :136486
## NA's :96985
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 2014-01-22 00:00:00: 491 Q4 2013:14450
## 1st Qu.: 4000 2013-11-13 00:00:00: 490 Q1 2014:12172
## Median : 6500 2014-02-19 00:00:00: 439 Q3 2013: 9180
## Mean : 8337 2013-10-16 00:00:00: 434 Q2 2013: 7099
## 3rd Qu.:12000 2014-01-28 00:00:00: 339 Q3 2012: 5632
## Max. :35000 2013-09-24 00:00:00: 316 Q2 2012: 5061
## (Other) :111428 (Other):60343
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 63CA34120866140639431C9: 9 Min. : 0.0 Min. : -2.35
## 16083364744933457E57FB9: 8 1st Qu.: 131.6 1st Qu.: 1005.76
## 3A2F3380477699707C81385: 8 Median : 217.7 Median : 2583.83
## 4D9C3403302047712AD0CDD: 8 Mean : 272.5 Mean : 4183.08
## 739C338135235294782AE75: 8 3rd Qu.: 371.6 3rd Qu.: 5548.40
## 7E1733653050264822FAA3D: 8 Max. :2251.5 Max. :40702.39
## (Other) :113888
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-664.87
## 1st Qu.: 500.9 1st Qu.: 274.87 1st Qu.: -73.18
## Median : 1587.5 Median : 700.84 Median : -34.44
## Mean : 3105.5 Mean : 1077.54 Mean : -54.73
## 3rd Qu.: 4000.0 3rd Qu.: 1458.54 3rd Qu.: -13.92
## Max. :35000.0 Max. :15617.03 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : -94.2 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -14.24 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7000 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.00000
## Median : 0.00 Median :1.0000 Median : 0.00000
## Mean : 25.14 Mean :0.9986 Mean : 0.04803
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :21117.90 Max. :1.0125 Max. :39.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
So I knew that there were a lot of observations, but looking through that summary gave me a better idea of how many there are, it’s a bit overwhelming. For now I’m going to pick out a more manigable number of variables that are interesting, run some anylsis on them, and then I can add more back later if I“m so inclined.
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
Out of those 81, below is the list of the ones I decided to keep. NExt I’ll subset my df and run some more summaries.
## LoanNumber CreditGrade Term
## Min. : 1 :84984 Min. :12.00
## 1st Qu.: 37332 C : 5649 1st Qu.:36.00
## Median : 68599 D : 5153 Median :36.00
## Mean : 69444 B : 4389 Mean :40.83
## 3rd Qu.:101901 AA : 3509 3rd Qu.:36.00
## Max. :136486 HR : 3508 Max. :60.00
## (Other): 6745
## LoanStatus BorrowerAPR BorrowerRate
## Current :56576 Min. :0.00653 Min. :0.0000
## Completed :38074 1st Qu.:0.15629 1st Qu.:0.1340
## Chargedoff :11992 Median :0.20976 Median :0.1840
## Defaulted : 5018 Mean :0.21883 Mean :0.1928
## Past Due (1-15 days) : 806 3rd Qu.:0.28381 3rd Qu.:0.2500
## Past Due (31-60 days): 363 Max. :0.51229 Max. :0.4975
## (Other) : 1108 NA's :25
## LenderYield ProsperScore BorrowerState
## Min. :-0.0100 Min. : 1.00 CA :14717
## 1st Qu.: 0.1242 1st Qu.: 4.00 TX : 6842
## Median : 0.1730 Median : 6.00 NY : 6729
## Mean : 0.1827 Mean : 5.95 FL : 6720
## 3rd Qu.: 0.2400 3rd Qu.: 8.00 IL : 5921
## Max. : 0.4925 Max. :11.00 : 5515
## NA's :29084 (Other):67493
## Occupation IsBorrowerHomeowner
## Other :28617 False:56459
## Professional :13628 True :57478
## Computer Programmer : 4478
## Executive : 4311
## Teacher : 3759
## Administrative Assistant: 3688
## (Other) :55456
## CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.0 Min. : 19.0
## 1st Qu.:660.0 1st Qu.:679.0
## Median :680.0 Median :699.0
## Mean :685.6 Mean :704.6
## 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :880.0 Max. :899.0
## NA's :591 NA's :591
## FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
## : 697 Min. : 0.00 Min. : 0.00
## 1993-12-01 00:00:00: 185 1st Qu.: 7.00 1st Qu.: 6.00
## 1994-11-01 00:00:00: 178 Median :10.00 Median : 9.00
## 1995-11-01 00:00:00: 168 Mean :10.32 Mean : 9.26
## 1990-04-01 00:00:00: 161 3rd Qu.:13.00 3rd Qu.:12.00
## 1995-03-01 00:00:00: 159 Max. :59.00 Max. :54.00
## (Other) :112389 NA's :7604 NA's :7604
## TotalCreditLinespast7years OpenRevolvingAccounts TotalInquiries
## Min. : 2.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 17.00 1st Qu.: 4.00 1st Qu.: 2.000
## Median : 25.00 Median : 6.00 Median : 4.000
## Mean : 26.75 Mean : 6.97 Mean : 5.584
## 3rd Qu.: 35.00 3rd Qu.: 9.00 3rd Qu.: 7.000
## Max. :136.00 Max. :51.00 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent IncomeVerifiable
## Min. : 0.0000 Min. : 0.0 False: 8669
## 1st Qu.: 0.0000 1st Qu.: 0.0 True :105268
## Median : 0.0000 Median : 0.0
## Mean : 0.5921 Mean : 984.5
## 3rd Qu.: 0.0000 3rd Qu.: 0.0
## Max. :83.0000 Max. :463881.0
## NA's :697 NA's :7622
## StatedMonthlyIncome MonthlyLoanPayment
## Min. : 0 Min. : 0.0
## 1st Qu.: 3200 1st Qu.: 131.6
## Median : 4667 Median : 217.7
## Mean : 5608 Mean : 272.5
## 3rd Qu.: 6825 3rd Qu.: 371.6
## Max. :1750003 Max. :2251.5
##
First thing that stands out to me in this is the dramatic difference between the median and max values for most of the credit related variables. The open Revolving accounts has a median 6 and a max of 51. The Total Inquiries has a median of 4 and an astounding max of 379! I’ll definitely look a little more closely. Also, Our credit rating seems to be mostly blank values, so that’ll be a problem later.
As I suspected from the summary, there must be very few outliers, with more than the third quartile. Perhaps only one becaeuse they’re not even showing up on this chart. It’s possible that there’s some erroneous data, or perhaps there’s one guy who really really likes making credit Inquiries, either way this doesn’t help me learn anything about most of my data, so next get a closer look at my data without those outliers. First I’ll just guess at some limits based my observations from this plot.
So this is much more interesting, already, we can see what our distrobution actually looks like. Next I’ll run some more summary statistics to get a better feel for it, and probably build one more plot based off that information.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.000 4.000 5.584 7.000 379.000 1159
In the above summary for total inquaries we can see that our 2rd quartle is seven despite having a max of 379. Next I’ll look at the 95th quantile.
## 95%
## 16
looks like 95% of our observations have fewer than 16 Inquiries. I’ll build one more plot on this data using that limit.
Next I just want to see if the other credit metrics follow the same pattern, it would make since that total number of credit accounts distribute the same way that inquiries do.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.00 9.00 9.26 12.00 54.00 7604
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.00 6.00 6.97 9.00 51.00
The graphs for Open Credit Lines and Open Revolving Accounts are very similar to each other, but they do not match the total inquries as well as I thought they might. They are not skewed near as heavily left.
This is intersting, it looks like most people have very few delinquincies with again a small number of outliers. Let’s refine this plot a bit.
Looks like most people in this data set don’t have any current delinquencies, so WTG most people! Next we’ll look at credit ratings. First, just a quick bar chart to see what we’ve got. The previous variables have all been continuous, so histograms were used, however this value is discrete so I’ll be using a bar chart.
Now it’s time that that blank data for Credit Grades is a problem. I looked at the definitions, and it turns out we only have this data for pre 2009 records. Because I don’t have any way of aquiring them, for now I’ll just redo this plot without them to get a closer look at the values we do have.
Removing the blanks cuts down our observations significantly, which isn’t great. That being said, we can still make some better observations this way. Looks like most of our data set has a rating of C, and very few have no credit. A reasonable hypothesis to explain this is that one would be less likely to apply for a loan if one has no credit.
Next I want to look at Stated Monthly Income by Occupation. I’m going to do a quick summary of it.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750003
Just like with some of our credit metrics, we have a max value way outside of our 3rd quartile. I’ll use the 95th quantile for our graph.
Interesting spikes in our dataset, it makes since that people would report generally round numbers for their income. Shout out to everyone who seems to have gotten a loan and reported 0 income.
Lastly for my Univariate exploration I’ll look at occupation data. I’d like to compare occupations in my upcoming bivariate exploration so it’ll be helpful to have an understanding of the distrobution. First I’ll see how many unique occupations we’re working with.
## [1] "There are 68 Occupations"
That’s a lot, so I probably won’t analyze all of them, but I want to get a feel for what the distrobution looks like, to see if there’s anything of which I should be aware.
So our first and second largest bars are “Other” and “Professional” respectively. I’ll leave thoes out of my analyisis because those aren’t descriptive enough.
This is a huge dataset with as many variables and observations as anyone could hope for. There’s a lot to explore here, and plenty more interesting things to notice than I had time to look into.
Key Observations: Most of the credit related metrics have huge outliers. The median amount of open revolving acounts is 6. The median amount of open credit accounts is 9. Most Observations have no delinquencies. Most of the data set is missing it’s credit ratings. Of the credit ratings we have, C is the most common.
The main features in this dataset in which I am interested are the various metrics of credit, namely credit rating. I am interested in seeing which other variables might serve as indicators of credit, such as occupation.
Occupation is the main variable I’ll look into but others such as term limit and APR should have some corrilations.
No, this dataset has plenty of variables, so I’ve yet to find it necessary.
Some of the distrobutions were made unusually by some extreme outliers. The only changes I made were to remove outliers and missing values from certain variables.
For this section I’m going to be making a lot of comparisons around Occupations. I’m going to randomly sample ten occupations, and make sure I don’t get “Other” or “Professional”.
## [1] Executive Laborer
## [3] Sales - Commission Nurse (LPN)
## [5] Police Officer/Correction Officer Medical Technician
## [7] Chemist Student - College Graduate Student
## [9] Student - College Sophomore Student - College Freshman
## 68 Levels: Accountant/CPA Administrative Assistant Analyst ... Waiter/Waitress
Alright, so there’s my sample occupations, we’ll just look at these occupations. I made another dataframe with just the observations within the selected occupations.
So if I’d thought about it a bit more I’d have realized that this would be what that plot looked like - I still think it’s interesting though. Every ocuptaion has every credit raiting, exept for the ones that don’t have No Credit, (double negative is correct in this context) and the students that don’t have the higest credit ratings.
So this one is much more interesting. At first there were a couple zero values. Although I’m not an expert in credit scores, I do know zero vaules aren’t actually attainable, so I removed them from the second plot. You can see a bit tighter groupings for the students, which makes sense as a student theroretically would have had less time to get really good credit, or to really ruin it. Other occupations range pretty far. Transparency is applied to this plot such that the darker the dot the more values there are. You can see how our executives have a higher consentration of higher credit.
## [1] "Summary of Executives"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 440.0 660.0 700.0 702.9 740.0 880.0
## [1] "Summary of Laborers"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 420.0 640.0 680.0 672.7 700.0 860.0
## [1] "Summary of Sales - Commission"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 440 640 680 682 720 880
I would have expected there to be a but more variation by occupation here. I would’ve thought that executives would be more scewed toward the right, ie more of them would have higher credit scores. As you can see they have a similar distrobution to Sales - Commissions and Labor, they do however have a higher credit score of about 30 on Average. Not really a lot of counts for college students, which makes since that most people don’t apply for this type of loan in college.
Looks about like I expected this time, some people in sales are doing well for themselves, but the average is closer to where the Laborers are. Executives certainly have a broad range, but have more higher values than any other colum. Still not many students, and they don’t make much - especially Freshman. THe only thing that really surprises me here is Nurses. Personally I would’ve expected them to be more to the right.
For my last trick, I’d like to look at credit score vs income. I imagine that in general these should be pretty positively corrilated. It stands to reason that people who make more money will have better credit. Let’s see if that’s actually True.
There are our outliers again, let’s get rid of our Fancy Pants rich people and clean this plot up a bit.
##
## Pearson's product-moment correlation
##
## data: df_2$StatedMonthlyIncome and df_2$CreditScoreRangeLower
## t = 36.54, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1021433 0.1136511
## sample estimates:
## cor
## 0.1079008
So it looks income is not a good predictor of credit like I thought it might be. There are noticably fewer observations with low credit score and high income, (bottom right area) and there’s a noticable dip in values with high credit score and low income, (top left area) however overall there are plenty of high income people with lower credit scores and vise versa. Pearson’s R confirms that there is no meaningful corrilation.
So my analysis was mostly finding out things I thought might have relationships didn’t. Occupation doesn’t predict credit score very well, and neither did stated monthly income. Occupation and income were the only metrics that made sense in that respect, and even there there’s a lot of variance.
I didn’t really find any strong relationships, TO me the most interesting thing I found was the dips in the corners in the plot of credit score vs income.
Let’s make a plot similar to the credit score vs stated monthly income, but this time we’ll color by occupation from our list of ten.
Alright, I’m pretty happy with this plot. We can see where the different occupations stack as measured by Credit Scores and Monthly Income. This plot suffers from overplotting however, so I’m going to facet it to clean it up a bit more.
You can see how students (pink and purple circles) have a fair range of credit scores, some are great, some not so much. Most of them are hanging out toward the low end of the payscale though. You can see how our executives are better off for the most part, even though there are plenty of them that make less than our sales commisions persons.
Next I’m going to make a similar plot but with loan term instead of occupation. Instinctively I want to say that long loan terms will accompany low income and credit scores, but I’ve been wrong before.
This one doesn’t really look like wehat I thought it would, and if anything, it’s darker at the bottom left and gets lighter, ie the terms get longer, which is the opposite of what I guessed earlier. I’m 0/3 now, but yay for data correcting erroneous notions.
I thought that perhaps it’d be interesting to see all four variables, faceted by occupation and colored by term limit. My previous observations hold true, and this plot ties it all together.
Seeing occupation, credit score, and income altogether was really interesting to me. Although this analysis didn’t end up being exactly about finding predictor variables, I did see how occupation and wages effect credit score.
The overall lack of corrilation between credit score and monthly income’s combined effect on term limit was suprising to me, I guess because shorter terms generally cost more upfront I expected them to exist mostly around low income individuals, but that didn’t appear to be the case.
This plot from my first univarient anlysis stood out to me. It was intersting to see how quickly the number of inquries declined for most of the population, but still how some people seem to make multiple inquiries a day.
I thought this plot was noteworthy, it was interesting to see how incomes were dispersed, particularly the spikes at standard seeming values. It was also interesting to compare the distrobutions of the different job types.
This Final plot pulls everything together, and shows how Credit Score, Monthly Income, Loan Term, and Occupation all work together. It paints a picture of these elements in an interesting and descriptive way.
How Was This Analysis
This was a really fun dataset to study and learn about. Exploring income data and how that relates to loans was interesting, there are a lot of subtle things happening that are interesting to investigate. My buggest struggle with this was probably finding which variables made the most sense to investigate. Having a dataset this large is good in some ways because you have a lot to work with, but it can certantly complicate things and make a concise analysis difficult. Overall I’m happy with what I was able to learn about, and the things I was able to find out about this dataset.
What Would I do Next Time?
There’s really so much more left to explore in this dataset, I could likely spend weeks working on it. Another interesting thing to see for future analysis would be to add state data. Seeing which states show up the most, which jobs are in which states, which states have more debt, etc. There are plenty of unanswered questions for next time.